Predicting oil reserves and optimizing well placement using Machine Learning.

Oil mining company has reservoir data containing oil well parameters for some selective basins/regions. As a data scientist, you've been hired to analyze reservoir data and build a model for predicting reserves in the new wells. You are tasked with optimizing well placement and maximizing profit. An important deliverable for this project is to analyze risks using the bootstrap technique.

Objective

Production forecast and reserves estimate are an essential input in the decision-making and investment evaluation scheme for any oil company. Oil companies and reservoir asset managers must factor in the reserves, production forecasts, and estimated ultimate recovery in determining whether a production project will be viable and profitable or not. In addition to reservoir volume, operational risk management is another important metric for oil companies. To this end, we need to find the best well placement and build a model to predict the volume of reserves and maximize profit by picking the region with the highest total profit. The model developed will be useful as a basis for critical decision making during reservoir management and field development planning.

Conditions:

Import libraries

Load data

Data Check

From the general information about the dataset, we can see that the data does not have any missing values. No duplicate rows. In region 3, f1, f2, and f3 are all normally distributed. We observe significant correlation between f2 and the product reserve in all regions

Data preparation

Join train data sets from the different region to train the data on more data point for better model performance. The feature region is also preserved using the region column as the region might also have some predictive power.

Test data was set aside for each regoin to evaluate model performance. Training data from the different region was combined to train the model on more data points and make the data more robust. The region was preserved as a categorical variable. This was encoded usind OHE. Numerical features were standardized.

Build and train model

A linear regression model was built and trained using the train data. Peformance of the model was evaluates using cross validation score of the RMSE.

The model has a validation RMSE score of 38.16

We also observe from the model beta coefficent that the f2 feature has a high level of impact and importance on the product reserve. This confirms the trend seen during our data check where the f2 score had significant correlation to the product reserve.

We also notice that the region is also an important feature in predicting product reserve

Model prediction

Region 3 has the highest average reserve per well. Our model also predicted this region to have the highest average oil reserve

Well selections

Select 500 well for the study in each region. The data has been shuffled already during train, test, split

The selected wells in region 2 have the highest average reserve and hence is the most profitable on average.

Calculate risks and profit for each region

We observe that region 2 has the highest average profit. It also has no probability of loss from the risk evaluation. This is also supported by the profit distribution plot shown above as the x range has no negative value.

Conclusion

Data Prep

it was observed that the data does not have any missing values. We noted that each region contains 100000 rows and 3 features.We observe significant correlation between f2 and the product reserve in all regions. Region column was included as a categorical feature to keep information about the the region each well came from. The data was split into train and test dataset for each region.

Model training

The train data set for region was concatened to train the model on a more robust dataset. The linear regresion model was trained on the combines dataset and model performance on the RMSE was done through cross validation. This gave us an RMSE validation score of 38.

The model beta coefficent shows that the f2 feature has a high level of impact and importance on the product reserve. This confirms the trend seen during our data check where the f2 score had significant correlation to the product reserve.

The beta coefficent also shows that the regions of the well is an important feature in predicting product reserve.

Well / Region Selection

For an capital cost of 100 million to break even the total reserve from the 200 selected oil wells must be at least 22223 thousand barrels. An average reserve should be 112 thousand barrels per well.

From the study of 500 well per region the model was used to select 200 wells with the highest reserve in each region. Profit was then calculated based on the selected wells from the region. The Region 2 has the highest total reserve for the selcted 200 well, about 23404.27 thousand barrels. It also has the highest average profit of 5.36 million dollar.

The selected well sample were sample 1000 time using bootstrapping technique to get a distribution of the average profit in each reach. Similar to earlier observation Region 2 had the highest mean of mean profit and a 95% profit confidence interval between 5.28 and 5.43million.

The selected wells in region 2 also have the no risk of loss. The risk of loss in region 1 and 3 were about 5.2 and 7.1 percent respectively.

Recommendation

Based on the result of this study, OilyGiant mining company should focus more oil wells' development activities in Region 2. This is because Region 2 generated the highest average profit than other regions and has the lowest risk of loss.